222 research outputs found
Character-Level Language Modeling with Deeper Self-Attention
LSTMs and other RNN variants have shown strong performance on character-level
language modeling. These models are typically trained using truncated
backpropagation through time, and it is common to assume that their success
stems from their ability to remember long-term contexts. In this paper, we show
that a deep (64-layer) transformer model with fixed context outperforms RNN
variants by a large margin, achieving state of the art on two popular
benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good
results at this depth, we show that it is important to add auxiliary losses,
both at intermediate network layers and intermediate sequence positions.Comment: 8 pages, 7 figure
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Recent work has shown that either (1) increasing the input length or (2)
increasing model size can improve the performance of Transformer-based neural
models. In this paper, we present a new model, called LongT5, with which we
explore the effects of scaling both the input length and model size at the same
time. Specifically, we integrated attention ideas from long-input transformers
(ETC), and adopted pre-training strategies from summarization pre-training
(PEGASUS) into the scalable T5 architecture. The result is a new attention
mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's
local/global attention mechanism, but without requiring additional side-inputs.
We are able to achieve state-of-the-art results on several summarization tasks
and outperform the original T5 models on question answering tasks.Comment: Accepted in NAACL 202
Multilingual Universal Sentence Encoder for Semantic Retrieval
We introduce two pre-trained retrieval focused multilingual sentence encoding
models, respectively based on the Transformer and CNN model architectures. The
models embed text from 16 languages into a single semantic space using a
multi-task trained dual-encoder that learns tied representations using
translation based bridge tasks (Chidambaram al., 2018). The models provide
performance that is competitive with the state-of-the-art on: semantic
retrieval (SR), translation pair bitext retrieval (BR) and retrieval question
answering (ReQA). On English transfer learning tasks, our sentence-level
embeddings approach, and in some cases exceed, the performance of monolingual,
English only, sentence embedding models. Our models are made available for
download on TensorFlow Hub.Comment: 6 pages, 6 tables, 2 listings, and 1 figur
CoLT5: Faster Long-Range Transformers with Conditional Computation
Many natural language processing tasks benefit from long inputs, but
processing long documents with Transformers is expensive -- not only due to
quadratic attention complexity but also from applying feedforward and
projection layers to every token. However, not all tokens are equally
important, especially for longer documents. We propose CoLT5, a long-input
Transformer model that builds on this intuition by employing conditional
computation, devoting more resources to important tokens in both feedforward
and attention layers. We show that CoLT5 achieves stronger performance than
LongT5 with much faster training and inference, achieving SOTA on the
long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably
make use of extremely long inputs, showing strong gains up to 64k input length.Comment: Added CoDA reference and minor edits to clarify routin
- …